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Abstract 

/ describe an approach to similarity motivated by Bayesian 
methods. This yields a similarity function that is learnable 
using a standard Bayesian methods. The relationship of the 
approach to variable kernel and variable metric methods is 
discussed. The approach is related to variable kernel Ex- 
perimental results on character recognition and 3D object 
recognition are presented. 



1 Introduction 

Visual object recognition, character recognition, speech 
recognition, and a wide variety of statistical and engineer- 
ing problems involve classification. Classifiers attempt to 
assign class labels to novel, unlabeled data based on previ- 
ously seen labeled training data. For example, determining 
the identity of a letter (the "class") from a scanned image 
of the latter (the "feature vector") is an example of a classi- 
fication problem occurring in optical character recognition 
(OCR). 

Two very common approaches to solving classification 
problems are Bayesian methods and nearest neighbor meth- 
ods. In Bayesian methods, we model class conditional dis- 
tributions and use those estimates for finding minimum er- 
ror rate discriminant functions. In nearest neighbor meth- 
ods, we classify unknown feature vectors based on their 
proximity in feature space (usually, some Euclidean space, 
M"^) to previously classified samples. 

Nearest neighbor methods can actually be viewed as a 
special case of Bayesian methods if we view the nearest 
neighbor procedure as implicitly using a non-parametric ap- 
proximation of class conditional densities. Asymptotically, 
the error rate of nearest neighbor procedures is known to 
be within a factor of two of the Bayes optimal error rate 
Bl . Just as important as the asymptotic error rate is how 
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quickly the error rate of a classifier decreases with increas- 
ing amounts of training data. 

To achieve improvements in these areas, a number of au- 
thors (e.g., lfT5l l8l [T0l[T8l l3l) have proposed using similarity 
functions other than the Euclidean distance in nearest neigh- 
bor classification, and give on-line or off-line procedures for 
computing such similarity function^Another recent devel- 
opment is the increased demand in applications for sound 
ways of determining the "similarity" of two objects in ar- 
eas like 3D visual object recognition, biometric identifica- 
tion, case based reasoning, and information retrieval (e.g., 

EEl). 

This paper describes a notion of similarity that is di- 
rectly grounded in Bayesian statistics and that is learnable 
based on training examples using a wide variety of well- 
know density estimation methods and classifiers. It then 
discusses the relationship between such a notion of statis- 
tical similarity and nearest neighbor classification. The ap- 
proach is motivated with several examples. Experiments 
on learning character recognition and 3D object recognition 
are discussed. 

2 Some Bayesian Decision Theory 

To establish notation and background, let us briefly review 
a few aspects of Bayesian decision theory relevant to clas- 
sification problems. Bayesian decision[l ,4J tells us that the 
approach for finding minimum error rate solutions to clas- 
sification problems is the following]^ Let be a finite set 
of possible classes. Let our feature vectors x be vectors in 
W'-. First, estimate the class conditional densities P{bj\x). 
Then, choose the class a; e il that has the maximum poste- 
rior probability given the input data x e 

The differences among different classification methods 
come down to different tradeoffs and approaches in estimat- 

' They are often referred to as "adaptive similarity metrics", but they do 
not satisfy the metric axioms and to avoid confusion, we refer to them here 
as "similarity functions". 

- Without loss of generality, we consider minimization of the expected 
loss under a zero-one loss function only in this paper. 
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ing P{oj\x). P{uj\x) is usually estimated from a large set of 
training samples {{xi,uii), . . . , (x„, a;„)}, the training set. 
Here, the Xi are measurements or feature vectors, and the 
oji are the corresponding classes. 

One of the most common ways of estimating P{lo\x) is 
to estimate P{x\uj) and then apply Bayes rule: 



P{uj\x) = 



P{x\u)) P{uj) 



(1) 



For example, if samples x are generated by picking a 
per-class prototype and adding Gaussian random noise 
A'^ ~ G(0, S) to it, then x ~ x^^ + N ox, equivalently, 
P{x\uj) = G{x^,'E). This may be extended to allowing 
multiple prototypes per class, giving mixture of Gaussian 
models P{x\co) = J2i G{xaj,i, S). Such parametric models 
for the basis of many appUcations of classification in control 
theory and speech recognition. They are attractive because 
we can often derive the distribution of the noise from first 
principles and estimate the parameters of the noise distribu- 
tion using closed-form approaches. 

Another approach to modeling P{x\u!) is that of many- 
parameter or non-parametric density estimation, using, 
for example, multi-layer perceptrons, logistic regression, 
Parzen windows, and many other techniques. In essence, 
this is a special case of a function interpolation prob- 
lem, where P{x\uj) is to be interpolated based on train- 
ing samples. In fact, in many cases, estimating P{x\lv) 
can be solved by least square Unear regression on the data 
set {{xi,yi), . . . , (.T„, where as the regression vari- 
able yi, we pick the value of the indicator function j/j = 
[w = WjJ , that is, we set yt to 1 if Wj = w and otherwise. 
Logistic regression and classification using multi-layer per- 
ceptrons are closely related to such an approach. 

Since, for classification under a given loss function, we 
are only interested in argmax^^ P{uj\x), many approxima- 
tions to the posterior density are equivalent from the point 
of view of classification. This can be expressed by say- 
ing that instead of estimating densities, we attempt to find 
decision functions D^{x) such that classifying according 
to Lo{x) = a.vgmax.Di^{x) results in minimum error rates. 
Such an approach is taken by, for example, linear discrim- 
inant analysis and support vector machines (it has been ar- 
gued that this relaxation of the density approximation prob- 
lem results in lower error rates). 

3 Bayesian Similarity and Classifica- 
tion 

The motivation for the Bayesian similarity model intro- 
duced in this paper is the following. Assume we are per- 
forming nearest neighbor classification. We are given a pro- 
totype x' together with its class label oj' and an unknown 



vector X to be classified. If we could estimate the probabil- 
ity that X and x' represent the same class, then we could use 
this to determine the probability that vector x comes from 

class uj'. 

Let us write this probability as P{S\x,x'), where S is 
a binary variable, 5 = 1 is a; and x' come from the same 
class, and 5 = otherwise. We can express P{S\x,x') in 
terms of P{lv\x) and P{ui\x') and use this as the definition 
of Bayesian statistical similarity. 

Definition 1 The (Bayesian) statistical similarity function 
S{x,x') is the conditional distribution P{S\x,x'), where 

S{x, x') = P{S = l\x, x') = J2 P{oj\x)P{uj\x') (2) 

wen 

Note that in this definition, the distributions P{uj, x) and 
P{oj, x') need not be the same. 

There are a number of properties we should observe. 
First, statistical similarity functions assume values in the 
interval [0,1]. Also, statistical similarity is dependent to 
some degree on the classification problem we are consid- 
ering (although we will see below that statistical similarity 
can generalize to a wider variety of classification problems 
than, say, a set of discriminant functions). A value of 1 
means that two feature vectors x and x' are known to be 
in the same class. However, S(x, x) can be less than one, 
namely when the feature vector x cannot be classified un- 
ambiguously. 

Now that we have a formal expression for P{S\x,x'), 
let us look at the classification rule. For this, we first need 
another definition. 

Definition 2 Given some uj & Q,, let us call x^, an unam- 
biguous exemplar /or class co iffP{u)\x^) = 1; because of 

normalization, this also means that P{w'\x^,j) = when 
Lo' ^ oj, or P{uj'\xtS) = <5(w', uS). 

If xq is an unambiguous exemplar for class wq, then 

P{S = l\x,xo) = ^P(a;|a;)P(a;|a;o) (3) 

UJ 

= Y,P{u\x)d{u,uo) (4) 

UJ 

= P{uj\x) (5) 

Therefore, we have shown the following: 

Theorem. If xq is an unambiguous exemplar for class 
Wo, then P{ujq\x) = P{S = l|a;,a;o). 

The point of these derivations was to make a connec- 
tion between statistical similarity functions and Bayesian 
decision theory. Overall, what this shows is that we can 
represent P{u)\x) as a statistical similarity function P{S = 
l\x, x') and a set of unambiguous exemplars {a;^,}. It gives 
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us a prescription for constructing a nearest neighbor classi- 
fier for many kinds of classification problems that is guar- 
anteed to achieve the Bayes optimal error rate. 

Of course, not all classification problems have unam- 
biguous exemplars; an analysis of such cases goes beyond 
the scope of this paper, and it is probably not necessary for 
real- world applications. For actual applications, we can use 
methods of machine learning for estimating the statistical 
similarity function and then pick a set of exemplars that 
empirically minimizes misclassification rate in a way anal- 
ogous to other nearest neighbor methods. 

4 Motivation 

Now that we have introduced statistical similarity functions, 
we might ask what advantages they could have over either 
models of posterior distributions or nearest neighbor meth- 
ods. The use of statistical similarity functions is somewhat 
analogous to Bayes rule: we apply Bayes rule when we find 
the estimation of P{x\uj) more convenient than the estima- 
tion of P{uj\x). In fact, there are several important ways in 
which the estimation of P{S\x, x') is more convenient than 
estimating class conditional or posterior distributions. 

First, and perhaps most importantly, learning P{S\x, x') 
can be done with unlabeled training data in some important 
cases. In 3D visual object recognition, we have a wealth 
of unlabeled training data available in the form of motion 
sequences. These motion sequences give us different ap- 
pearances of the same object in successive frames and can 
be used to train P(5|x, x') (see |6 | for an example of a sys- 
tem that takes advantage of this). Furthermore, we would 
expect statistical similarity to be able to take advantage of 
some properties that are independent of object class; we will 
return to this point in the next section. Also, we often have 
to solve a set of related classification problems, for which 
we keep P{S\x, x^) constant but use different sets of pro- 
totypes x^. 

The above definitions and theorems are intended to mo- 
tivate the use of statistical similarity and to make a con- 
nection with Bayesian approaches to classification based on 
class conditional and posterior densities. However, having 
a statistical similarity measure available does open up new 
applications that do not fit well into a traditional classifi- 
cation framework. For example, in a case-based reasoning 
framework, information retrieval, or 3D visual object recog- 
nition framework, we may not have a meaningful set of a 
priori classifications. Rather, the goal of the problem is to 
find the case, text, or view that is most likely to have been 
derived from the same underlying situation as the query. 
In fact, several authors have formulated specific statistical 
models for statistical similarity in case-based reasoning [TJ 
and information retrieval (e.g., ifTSiriTI '). 



5 An Example 

Consider a classification problem in which the observed 
vectors are distributed according Xo x ^ x^ + N , where 
x^^ is a class prototype and N is iid noise, independent 
of the object class. Then, P{x\uj) — N{x — x^i)- Since 

p(q\^ ^ \ _ p(.,\^\ _ P(X\u)P{u>) _ N{x-x^)P{u>) 
- r{LO\X) - - N(x-x^,)^ 

we see that P{S\x, x^) is translation invariant: if we trans- 
late X and the prototypes x^, classification will be car- 
ried out the same way. Furthermore, staying with this 
example, if the prototypes x^ are displaced by differ- 
ent amounts A^^, P{S\x,x^ + A^^) may not be an ac- 
curate estimate of P{ui\x) anymore, but the decision rule 
arg maxi^ P{S\x, x^ + A^^) can still be seen to be correct. 
In practice, N may not be completely independent of x, but 
if it varies slowly, we can choose models of P{S\x, x') that 
take advantage of this fact. 

In fact, this last example provides a connection with 
adaptive metric models. Consider a simple adaptive met- 
ric model in which we optimize a quadratic form Q for our 
metric in order to minimize the error rate; that is, we use as 
our decision rule 

uj{x) = argmin(a; - x^) ■ Q ■ {x - x^) (6) 

UJ 

If our decision rule is lj{x) = sigm&y.^ P{S\x,Xt^) and 
our noise model is a Gaussian G(0, S), then, by the 
above argument, 

(jj{x) = aigvaa,yiP{S\x,Xi^) (7) 

= arg max ^J^'f\ G'(x - x^, S) (8) 
P[x) 

= argmaxG(x — a;^^, S) (9) 

= argmaxe^^'^""'"^'^'^"'"''"^ (10) 

UJ 

= argmin(x - Xij) • E • (a; - x^) (11) 

By comparing Equation [TT] and Equation |6j we see that we 
can use E as the quadratic form Q (the choice is not entirely 
unique). 

6 Character Recognition 

The above ideas were tested on an isolated handwritten 
character recognition task using the NIST 3 database ||9l 
(see also lfl4]| for a state-of-the-art character recognition 
system and comparisons of a large number of classifiers). 
Similar experiments have been used in other works on vari- 
able and adaptive metric methods (e.g., 13]). 

The overall idea is to estimate P{S\XtX') using multi- 
layer perceptrons (MLPs) as a simple and well-studied 
trainable model of posterior probabilities. Then, we use 



3 



P{S\x,x') as our "distance" in a fc-nearest neighbor clas- 
sifier and compare its performance with the performance of 
a standard nearest neighbor classifier 

The images used in these experiments were images of 
handwritten digits from the NIST 3 database. Randomly 
selected images from the first 1000 writers were used for all 
training, randomly selected images from a set of 200 sep- 
arate writers were used for testing. For feature extraction, 
bounding boxes for characters were computed and the char- 
acters were rescaled uniformly to fit into a 40 x 40 image. 
The resulting character image was slant corrected based on 
its second order moments. The uncorrected and slant cor- 
rected images form the first two feature maps. Derivatives 
were estimated along multiples of ^ degrees, resulting in 
five feature maps. Additionally, feature maps of interior re- 
gions, skeletal endpoints, and skeletal junction points were 
computed. Each of the resulting feature maps was anti- 
aliased and scaled down to a 10 x 10 grid. This results 
in 10 10 X 10 feature maps, or a 1000 dimensional fea- 
ture vector. (Experiments were also carried out with sub- 
sets of these feature maps consisting of only the raw image, 
100 dimensional, or the raw image, the slant corrected im- 
age, and derivatives, 700 dimensional, with similar results.) 
This feature extraction method was chosen because it has 
worked well for character recognition using multi-layer per- 
ceptrons as classifiers [?]; however, there is no reason to 
believe that it is a particularly good representation for the 
purpose of learning statistical similarity functions, and the 
performance of the system can probably be improved by 
experimenting with other feature extraction methods. 

To obtain statistical similarity models a multi-layer per- 
ceptron (MLP) was trained using gradient descent training. 
It has been shown (see |4| p.304) that training a multi-layer 
perceptron under a least square error criterion and binary 
output variables results in an approximation to the poste- 
rior probability distribution. The feature vectors from each 
image were concatenated to yield the feature vector that 
formed the input to the MLP. When the classes correspond- 
ing to the feature vectors in the NIST database were the 
same, the target output during training was set to 1, other- 
wise 0. 

After estimating a statistical similarity function this way, 
the statistical similarity function was used in a simple near- 
est neighbor classifier To select the prototypes for the near- 
est neighbor classifier, feature vectors from the training set 
were compared to the set of prototypes (initially empty) and 
the class associated with the most similar, according to the 
statistical similarity function, was returned as the classifica- 
tion. Whenever the classification was incorrect, the incor- 
rectly classified feature vector was added to the set of proto- 
types. This process was stopped when the set of prototypes 
had grown to 200 prototypes. 

To estimate misclassification rates, 5000 feature vectors 



Statistical Similarity 
Nearest Neighbor 


Euclidean 
Nearest Neighbor 


2.6% 


9.5% 



Table 1 : An experimental comparison of the performance 
of Euclidean nearest neighbor methods with statistical sim- 
ilarity based nearest neighbor methods. The error rates are 
derived from 5000 test samples, using 200 prototypes se- 
lected as described in the paper. 



Statistical Similarity 
Nearest Neighbor 


Euclidean 
Nearest Neighbor 


5.1% 


22.6% 



Table 2: An experimental comparison of the performance 
of Euclidean nearest neighbor methods with statistical sim- 
ilarity based nearest neighbor methods on a rapid writer 
adaptation problem. The error rates are derived from 8767 
test samples, using 10 prototypes selected as described in 
the paper 



were selected from a separate test set and classified like 
the training vectors (however, misclassified feature vectors 
were not added during the set of prototypes). As a control, 
the same training and testing process was carried out using 
Euclidean distance. The results of these experiments are 
shown in Table [T] They show a 2.7-fold improvement of 
using statistical similarity over Euclidean distance. 

In a second set of experiments, the statistical similarity 
function was trained not on randomly selected pairs of fea- 
ture vectors, but only on pairs of feature vectors from the 
same writer. This means that the statistical similarity func- 
tion characterizes the variabiUty for individual writers. For 
testing, feature vectors from 200 writers not in the training 
set were used. For each writer, the first instance of each 
character was used as a prototype, resulting in 10 proto- 
types per writer. These prototypes were then used to clas- 
sify the remaining samples from the same writer These 
results are shown in Table |2] The results show a 4.4-fold 
improvement of statistical similarity over Euclidean nearest 
neighbor methods. 

These experimental results demonstrate that using sta- 
tistical similarity functions can result in greatly improved 
recognition rates compared to Euclidean nearest neighbor 
classification methods-statistical similarity functions are an 
effective "adaptive metric" for these kinds of problems. 
However, that is all these initial experiments were designed 
to test, and several important experiments remain to be 
done; we will return to this issue in the Discussion. 
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7 Learning Single View Generaliza- 
tion 

As a second problem to be addressed using statistical simi- 
larity, we consider the problem of generalizing the appear- 
ance of a 3D object to novel viewpoints given a single view. 
As an example of statistical similarity, this is an interesting 
problem because the problem cannot be solved as a classifi- 
cation problem. The problem occurs in both psychophysics 
and computer vision and previous approaches to it in the 
literature have postulated very specific representations of 
views and 3D objects in order to admit such generalization; 
a statistical similarity approach like the one presented here 
also yields a simpler and potentially more general theory of 
such single view generalization phenomena. 

In the following, we consider the space of 3D paperclips, 
as is frequently used in both psychophysical experiments 
and theoretical work on learning in computer vision |19|. 
That is, a model M is an ordered set of points in M.^. Mod- 
els are constructed by concatenating k = 5 unit vectors in 
M"^ at randomly chosen orientations, meaning that M can 
be represented as a vector in K^^. The method in which 
models M are constructed randomly gives rise to a prior 
distribution P{M) over the space of models. 

Given a set of viewing parameters, V, we derive a view 
V from the model through a parameterized imaging trans- 
formation B — fv{M); here, the imaging transformation 
is assumed to be a rigid body transformation followed by 
orthographic projection. After the imaging transformation, 
the image B is represented in one of three different ways: 
as a list of (x, y) feature locations in the image (in vertex 
order), as list of 2D angles between successive edges, and 
as a quantized features map on a 40 x 40 grid, with each grid 
square indicating the presence or absence of a vertex within 
that square. These are representations that have been com- 
monly used in experiments on 3D recognition on paperclips 
by previous authors 1 19|. 

The viewing transformation can be written as a condi- 
tional density (this is simply expressing the same functional 
relationship using the notation of a conditional probability): 

PiB\V,M)^6{B,fv{M)) (12) 

Here, 6 is the Dirac delta function. In the presence of noise 
on the location of feature points or the location of model 
points, the delta function is replaced by another distribution 
related to the noise. For example, under iid additive noise 
distributed according to a distribution N{x), the conditional 
distribution becomes: 

P{B\V,M) = N{B- fv{M)) (13) 

If we integrate out the (unobservable) distributions over 
noise and viewing parameters, we are left with a marginal 




Figure 1: An instance of forced choice used in the 3D 
recognition experiments. 



distribution P{B\M), the distribution of views of an model 
under these viewing conditions. 

The viewing parameters were represented by slant and 
tilt (rotations around two axes perpendicular to the optical 
axis of the observer). Slant and tilt angles were either drawn 
uniformly randomly from the interval [—40°, +40°] or from 
the set {—45°, +45°}, giving a prior distribution over view- 
ing parameters, P{V). 

Because of the projection involved in the imaging trans- 
form, there is potentially an infinity of models that could 
have given rise to a given image B. For example, all mod- 
els that differ only by their placement of vertices along the 
optical axis after rigid body transformation and the addition 
of noise are indistinguishable from their images. 

We use a forced choice framework of recognition. In the 
simplest case, the observer is presented with two views and 
has to decide whether they derive from the same 3D model 
or not. This kind of visual object recognition problem oc- 
curs, for example, in face verification, where an observer 
needs to compare two photographs of faces and perform 3D 
generalization based on a single view. A slightly more com- 
plicated forced choice problem is one in which an observer 
is presented with a target view and two unknown views, one 
of which is known to be derived from the same model as the 
target view (condition S — 1), and the other of which is de- 
rived from some other randomly chosen model (condition 
S = 0). We might call this the "police lineup" problem, in 
which an observer has to pick out a previously seen instance 
from a lineup known to contain an instance. These and sim- 
ilar forced choice experiments are commonly used in psy- 
chophysical experiments on visual object recognition. They 
have in common that there is no task-relevant classification 
or categorization of objects-two views derived from two 
different models in one experiment may well come from 
the same model in another trial. A representative instance 
of such a force choice problem is shown in Figure [T] 

Let the target view be B and B' be one of the unknown 
views. For concreteness, let us write down the relationship 
between P{S\B, B') and the generative model expressed 
as P{B\M). In analogy to Equation|2j for 5 = 1, we can 
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expand in terms of P{M\B) and apply Bayes rule: 

P{S 1\B,B') = j P{M\B)P{M\B')dM (14) 

_ [ P{B\M)P{B'\M)P{M)P{M) 

-J PiB)PiB') ^''^ 

Modeling and evaluating this integral and the individ- 
ual factors would be a difficult task; even the much 
simpler problem of finding the maximum a posteriori 
model matching from a fixed set of models, that is, 
argmaxMe{Mi,...,A/fc} P{M\B), has proved to be a daunt- 
ing computational task (c.f. fTSl). 

However, as in the character recognition example above, 
we can model P{S — 1\B,B') directly using non- 
parametric models of probability distributions. To do this, 
we can generate images from various models under differ- 
ent viewing parameters and train the probability model us- 
ing images derived from the same object model as positive 
examples and images derived from different object mod- 
els as negative training examples. As before, a multi-layer 
perceptron (MLP) was used to model posterior probability 
distributions. 

During training, a fixed set of 200 models was drawn 
once from the distribution P{M) and used for generating 
images. During testing, new, previously unseen models 
were drawn from P{M) and images generated from them 
as described above for the forced choice framework. The 
performance of statistical similarity was compared with the 
performance of Euclidean distance in feature space (equiv- 
alent to a least square match of the images when vertex lo- 
cations are used as features). The unknown view closest 
to the target according to statistical similarity or Euclidean 
distance was returned as the view more likely to have been 
derived from the same model as the target (when using sta- 
tistical similarity, this is easily seen to be the Bayes optimal 
decision rule). The results demonstrate an improvement in 
recognition performance from a factor 1.8 to a factor 22 of 
statistical similarity compared to nearest neighbor classifi- 
cation. Note that in both cases, the generalization to arbi- 
trary 3D paperclip models was based on a limited training 
sample of only 200 paperclips. 

8 Discussion 

This paper has introduced the notion of statistical similar- 
ity based on the conditional distribution P{S = l\x, x') = 
'Y^^P(ijj\x)P(ii)\x'). It was shown that classification us- 
ing a statistical similarity function and a set of unambigu- 
ous exemplars is equivalent to Bayesian minimum error rate 
classification. The paper has also brought variable metric 
nearest neighbor methods into the framework of statistical 
similarity measures. The paper has presented two sets of 
experiments. 



Features and Model 


Stat. 
Sim. 


Eucl. 
Uist. 


ordered angles MLP (8:100:1) 


10.9% 


19.9% 


ordered locations, MLP (20:100:1) 


0.12% 


0.86% 


ordered locations, MLP (20:100:1), 

±45° 


0.38% 


8.4% 


feature map, MLP (3200:100:1) 


7.9% 


32% 



Table 3: Experiments evaluating MLP-based statistical 
similarity relative to view based recognition using 2D simi- 
larity. Error rates (in percent) achieved by MLP-based sta- 
tistical view similarity models relative to error rates based 
on Euclidean distance (equivalent to 2D similarity in the 
case of location features). In all experiments, the train- 
ing set consisted of 200 clips consisting each of five ver- 
tices. The test set consisted of 10000 previously unseen 
clips drawn from the same distribution. The structure of the 
network is given as "{n:m:ry', where n is the number of 
inputs, m the number of hidden units, and r the number of 
outputs. 



First, it has compared the performance of statistical sim- 
ilarity with a Euclidean nearest neighbor classifier on two 
handwritten character recognition problems. Those exper- 
iments demonstrated a significant improvement relative to 
Euclidean nearest neighbor methods. This result should 
be considered merely a "sanity check"-it shows that the 
method can be used to construct similarity measures that 
are significantly better than the baseline of Euclidean dis- 
tance. Whether using statistical similarity as an improved 
"distance" in a nearest neighbor classifier ultimately will re- 
sult in a state-of-the-art character recognition system (e.g., 
lfT4ll ) is not clearly addressed by these results. A straightfor- 
ward application of fc-nearest neighbor classification using 
statistical similarity (or Euclidean distances, for that matter) 
is impractical anyway because it is too slow. However, the 
statistical similarity measure introduced in this paper can 
easily be used as part of a hierarchical or partitioned near- 
est neighbor classifier, and this is Ukely to be the best route 
towards constructing a statistical similarity based classifier 
that can handle the large number of prototypes needed to 
achieve state of the art performance. This remains to be 
done for future work. Of course, another set of experiments 
that would be desirable would be direct comparisons on the 
same dataset with adaptive nearest neighbor methods like 
those described by lT5l[8l[T0l[T8l[3l. 

A second set of experiments compared the performance 
of statistical similarity with the performance of Euclidean 
nearest methods on a 3D generalization problem in visual 
object recognition. This example is interesting because it 
lacks a class structure; as shown in |2^, it is impossible to 
partition a set of 3D models into non-overlapping sets of 
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views. In this case, similarity is not a means to an end, as 
in nearest neighbor classifiers, but it is an essential com- 
ponent of the problem-the system really needs to be able 
to carry out well-founded similarity judgements among ob- 
jects in order to perform well. The experiments on single 
view generalization using statistical similarity show that it 
gives greatly improved performance relative to 2D similar- 
ity. A practical advantage of the statistical similarity ap- 
proach to single view generalization compared to previous 
approaches ifTSl [TTl |5l is that it does not need to postulate 
any kind of special problem structure (hierarchical feature 
extraction, interpolable prototypes, class membership). 

Overall, this paper has outlined the beginnings of a 
Bayesian theory of learning similarity. As we noted in the 
introduction, statistical similarity is already implicitly mak- 
ing an appearance in a number of areas of computer vision, 
pattern recognition, and information retrieval. Perhaps its 
most important contribution is to show that notions of simi- 
larity that have previously been discussed in the form of ge- 
ometrically motivated "distance measures" or that are based 
on dyadic probability models having a specific parametric 
forms can be understood in, and unified under, a general 
Bayesian view. 

In the future, it will be important to see whether other 
forms of statistical similarities may be easier to estimate 
or manipulate; for example the conditional distribution 
P{B'\B,S) is an alternative to P{S = l\B,B') and has 
some computational advantages. While the framework of 
statistical similarity allows us to plug in arbitrary classi- 
fiers and features, some classifiers and feature types may 
turn out to be better suited to these kinds of problems. It 
has taken many years for the community to gain experience 
with this in the context of traditional classification prob- 
lems, and it will likely take some time to gain similar expe- 
rience for statistical similarity. As part of this, much more 
extensive benchmarking and performance evaluations than 
could be presented here will be needed. Some of the most 
promising applications of statistical similarity are on prob- 
lems where existing classification-based approaches don't 
apply at all (e.g., face verification, some information re- 
trieval problems), or where rapid adaptation to novel styles 
or problems are needed (e.g., multi-font OCR, on-line hand- 
writing recognition). 
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